It is important for physicians to be able to make predictions regarding a patient’s risk for chronic diseases. Demographics and certain behaviors can predispose someone to having certain disorders. This project aims to predict the presence of angina, stroke, and diabetes based on a number of demographic and behavioral measures. Specifically, the project uses machine learning based approaches to both predict patients’ with chronic diseases and narrow down which specific features are predictive of these diseases.
According to information aggregated by the CDC, in a given year, 659,000 people die from heart disease, 795,000 suffer from a stroke, and more than 87,000 die from diabetes (Virani et al., 2021). Given the high prevalence of each of these conditions, it is paramount for clinicians to have models that can accurately predict whether a given patient is likely to suffer either one. The aim of this project is to determine if risk-related behaviors are indicative of these conditions. It is hypothesized that with this given data, machine-learning based classifiers will be able to accurately predict patients with chronic diseases (angina, stroke, and diabetes), and that feature selection methods (linear regression, recursive feature elimination, random forest etc.) will narrow down the list of important predictors for each condition.
This project uses a data-set containing categorical variables related to demographics, health-related behaviors, and history of angina, stroke, and diabetes. To determine what features are most important in predicting the chronic diseases, random forest classification, recursive feature elimination, and linear regressions are used. Additionally, naive Bayes, linear discriminant analysis, and decision trees are used to predict if one of the three conditions for a given patient are present. Metadata can be found in the accompanying word file.
Let us import the needed libraries, load the data and and preprocess as needed.
library(plotly)
library(MASS)
library(nnet)
library(caret)
library(corrplot)
library(randomForest)
library(stats)
library(tm)
library(SnowballC)
library(biclust)
library(tidyverse)
library(psych)
library(mi)
library(car)
library(plotly)
library(gmodels)
library(e1071)
library(C50)
library(Boruta)
library(Rcpp)
# Loading the data in, changing the data to factor variables, and getting some summary information
health <- read.csv("data.csv")
summary(health)## ID AGE_G SEX RACEGR3
## Min. : 1.0 Min. :1.00 Min. :1.000 Min. :1.000
## 1st Qu.: 250.8 1st Qu.:3.00 1st Qu.:1.000 1st Qu.:1.000
## Median : 500.5 Median :5.00 Median :2.000 Median :1.000
## Mean : 500.5 Mean :4.34 Mean :1.574 Mean :1.345
## 3rd Qu.: 750.2 3rd Qu.:6.00 3rd Qu.:2.000 3rd Qu.:1.000
## Max. :1000.0 Max. :6.00 Max. :2.000 Max. :9.000
## IMPEDUC IMPMRTL EMPLOY1 INCOMG
## Min. :2.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.000
## Median :5.000 Median :1.000 Median :7.000 Median :4.000
## Mean :4.889 Mean :2.216 Mean :5.541 Mean :4.337
## 3rd Qu.:6.000 3rd Qu.:3.000 3rd Qu.:8.000 3rd Qu.:5.000
## Max. :6.000 Max. :6.000 Max. :8.000 Max. :9.000
## CVDINFR4 CVDCRHD4 CVDSTRK3 DIABETE3
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000
## Median :2.000 Median :2.000 Median :2.000 Median :1.000
## Mean :1.996 Mean :1.813 Mean :1.905 Mean :1.487
## 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :2.000 Max. :2.000 Max. :2.000 Max. :4.000
## RFSMOK3 RFDRHV4 FRTLT1 VEGLT1
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.000
## Median :1.000 Median :1.000 Median :1.000 Median :1.000
## Mean :1.319 Mean :1.257 Mean :1.736 Mean :1.715
## 3rd Qu.:1.000 3rd Qu.:1.000 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :9.000 Max. :9.000 Max. :9.000 Max. :9.000
## TOTINDA
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.574
## 3rd Qu.:2.000
## Max. :9.000
# Changing column names to something more readable
colnames(health) <- c("ID","Age","Sex","Race","School","Marriage","Employment","Income","HeartAttack","Angina","Stroke","Diabetes",
"Smoking","Alcohol","Fruits","Vegetables","Leisure")
# Factorizing the variables
health <- subset(health, select = -ID)
for(i in 1:length(colnames(health))){
health[,i] <- factor(health[,i])
}
# Getting summaries
summary(health$Angina)## 1 2
## 187 813
## 1 2
## 4 996
## 1 2
## 95 905
## 1 2 3 4
## 745 48 182 25
Angina, heart attacks, strokes, and diabetes are the main conditions we can predict using the risk factors in this data-set. However, only 4 people responded saying that they have had a heart attack previously. This is not enough data to make any reasonable predictions. Luckily, a higher number of people reported having the other conditions so, those conditions can be used for further analyses.
Now, we will use Bayes, linear discriminant analysis, and decision trees to predict the aforementioned health conditions.
## Prediction Models (Bayes, LDA, Decision Trees)
healthSubset <- sample(nrow(health),floor(nrow(health)*.60))
healthTrain <- health[healthSubset, ]
healthTest <- health[-healthSubset, ]
anginaBayesModel <- naiveBayes(healthTrain, healthTrain$Angina, type = 'class')
anginaPred <- predict(anginaBayesModel, healthTest)
anginaBayesCT <- CrossTable(anginaPred,healthTest$Angina)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Angina
## anginaPred | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 82 | 0 | 82 |
## | 252.810 | 65.190 | |
## | 1.000 | 0.000 | 0.205 |
## | 1.000 | 0.000 | |
## | 0.205 | 0.000 | |
## -------------|-----------|-----------|-----------|
## 2 | 0 | 318 | 318 |
## | 65.190 | 16.810 | |
## | 0.000 | 1.000 | 0.795 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.795 | |
## -------------|-----------|-----------|-----------|
## Column Total | 82 | 318 | 400 |
## | 0.205 | 0.795 | |
## -------------|-----------|-----------|-----------|
##
##
# Plotting confusion matrix
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(anginaBayesCT$prop.row[1,1], anginaBayesCT$prop.row[1,2], anginaBayesCT$prop.row[2,1], anginaBayesCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Angina Prediction)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))strokeBayesModel <- naiveBayes(healthTrain, healthTrain$Stroke, type = 'class')
strokePred <- predict(strokeBayesModel, healthTest)
strokeBayesCT <- CrossTable(strokePred,healthTest$Stroke)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Stroke
## strokePred | 1 | 2 | Row Total |
## -------------|-----------|-----------|-----------|
## 1 | 41 | 0 | 41 |
## | 322.202 | 36.797 | |
## | 1.000 | 0.000 | 0.102 |
## | 1.000 | 0.000 | |
## | 0.102 | 0.000 | |
## -------------|-----------|-----------|-----------|
## 2 | 0 | 359 | 359 |
## | 36.797 | 4.203 | |
## | 0.000 | 1.000 | 0.897 |
## | 0.000 | 1.000 | |
## | 0.000 | 0.897 | |
## -------------|-----------|-----------|-----------|
## Column Total | 41 | 359 | 400 |
## | 0.102 | 0.897 | |
## -------------|-----------|-----------|-----------|
##
##
# Plotting confusion matrix
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(strokeBayesCT$prop.row[1,1], strokeBayesCT$prop.row[1,2], strokeBayesCT$prop.row[2,1], strokeBayesCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Stroke Prediction)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))diabetesBayesModel <- naiveBayes(healthTrain, healthTrain$Diabetes, type = 'class')
diabetesPred <- predict(diabetesBayesModel, healthTest)
strokeDiabetesCT <- CrossTable(diabetesPred,healthTest$Diabetes)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Diabetes
## diabetesPred | 1 | 2 | 3 | 4 | Row Total |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 306 | 3 | 0 | 0 | 309 |
## | 20.502 | 11.525 | 49.440 | 6.180 | |
## | 0.990 | 0.010 | 0.000 | 0.000 | 0.772 |
## | 1.000 | 0.136 | 0.000 | 0.000 | |
## | 0.765 | 0.007 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 2 | 0 | 19 | 0 | 0 | 19 |
## | 14.535 | 308.500 | 3.040 | 0.380 | |
## | 0.000 | 1.000 | 0.000 | 0.000 | 0.048 |
## | 0.000 | 0.864 | 0.000 | 0.000 | |
## | 0.000 | 0.048 | 0.000 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 3 | 0 | 0 | 64 | 0 | 64 |
## | 48.960 | 3.520 | 282.240 | 1.280 | |
## | 0.000 | 0.000 | 1.000 | 0.000 | 0.160 |
## | 0.000 | 0.000 | 1.000 | 0.000 | |
## | 0.000 | 0.000 | 0.160 | 0.000 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## 4 | 0 | 0 | 0 | 8 | 8 |
## | 6.120 | 0.440 | 1.280 | 384.160 | |
## | 0.000 | 0.000 | 0.000 | 1.000 | 0.020 |
## | 0.000 | 0.000 | 0.000 | 1.000 | |
## | 0.000 | 0.000 | 0.000 | 0.020 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 306 | 22 | 64 | 8 | 400 |
## | 0.765 | 0.055 | 0.160 | 0.020 | |
## -------------|-----------|-----------|-----------|-----------|-----------|
##
##
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(strokeDiabetesCT$prop.row[1,1], strokeDiabetesCT$prop.row[1,2], strokeDiabetesCT$prop.row[2,1], strokeDiabetesCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Diabetes Prediction)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))Across the board, a naive Bayes approach is successfully in predicting incidents of medical complications with high true positives and true negatives. Lets look at linear discriminant analysis next.
ldaAngina <- lda(data=healthTrain, Angina~.)
ldaAnginaPred <- predict(ldaAngina,healthTest)
anginaCT <- CrossTable(ldaAnginaPred$class,healthTest$Angina)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Angina
## ldaAnginaPred$class | 1 | 2 | Row Total |
## --------------------|-----------|-----------|-----------|
## 1 | 61 | 1 | 62 |
## | 183.472 | 47.310 | |
## | 0.984 | 0.016 | 0.155 |
## | 0.744 | 0.003 | |
## | 0.152 | 0.002 | |
## --------------------|-----------|-----------|-----------|
## 2 | 21 | 317 | 338 |
## | 33.655 | 8.678 | |
## | 0.062 | 0.938 | 0.845 |
## | 0.256 | 0.997 | |
## | 0.052 | 0.792 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 82 | 318 | 400 |
## | 0.205 | 0.795 | |
## --------------------|-----------|-----------|-----------|
##
##
# Plotting confusion matrix
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(anginaCT$prop.row[1,1], anginaCT$prop.row[1,2], anginaCT$prop.row[2,1], anginaCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Angina Predictions)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))ldaStroke<- lda(data=healthTrain, Stroke~.)
ldaStrokePred <- predict(ldaStroke,healthTest)
strokeCT <- CrossTable(ldaStrokePred$class,healthTest$Stroke)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Stroke
## ldaStrokePred$class | 1 | 2 | Row Total |
## --------------------|-----------|-----------|-----------|
## 1 | 36 | 9 | 45 |
## | 213.588 | 24.393 | |
## | 0.800 | 0.200 | 0.112 |
## | 0.878 | 0.025 | |
## | 0.090 | 0.022 | |
## --------------------|-----------|-----------|-----------|
## 2 | 5 | 350 | 355 |
## | 27.075 | 3.092 | |
## | 0.014 | 0.986 | 0.887 |
## | 0.122 | 0.975 | |
## | 0.012 | 0.875 | |
## --------------------|-----------|-----------|-----------|
## Column Total | 41 | 359 | 400 |
## | 0.102 | 0.897 | |
## --------------------|-----------|-----------|-----------|
##
##
# Plotting confusion matrix
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(strokeCT$prop.row[1,1], strokeCT$prop.row[1,2], strokeCT$prop.row[2,1], strokeCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Stroke Predictions)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))ldaDiabetes<- lda(data=healthTrain, Diabetes~.)
ldaDiabetesPred <- predict(ldaDiabetes,healthTest)
diabetesCT <- CrossTable(ldaDiabetesPred$class,healthTest$Diabetes)##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 400
##
##
## | healthTest$Diabetes
## ldaDiabetesPred$class | 1 | 2 | 3 | 4 | Row Total |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 1 | 283 | 20 | 15 | 4 | 322 |
## | 5.459 | 0.296 | 25.887 | 0.924 | |
## | 0.879 | 0.062 | 0.047 | 0.012 | 0.805 |
## | 0.925 | 0.909 | 0.234 | 0.500 | |
## | 0.708 | 0.050 | 0.037 | 0.010 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 2 | 5 | 1 | 2 | 0 | 8 |
## | 0.205 | 0.713 | 0.405 | 0.160 | |
## | 0.625 | 0.125 | 0.250 | 0.000 | 0.020 |
## | 0.016 | 0.045 | 0.031 | 0.000 | |
## | 0.012 | 0.002 | 0.005 | 0.000 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 3 | 16 | 1 | 45 | 4 | 66 |
## | 23.560 | 1.905 | 112.321 | 5.441 | |
## | 0.242 | 0.015 | 0.682 | 0.061 | 0.165 |
## | 0.052 | 0.045 | 0.703 | 0.500 | |
## | 0.040 | 0.002 | 0.112 | 0.010 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## 4 | 2 | 0 | 2 | 0 | 4 |
## | 0.367 | 0.220 | 2.890 | 0.080 | |
## | 0.500 | 0.000 | 0.500 | 0.000 | 0.010 |
## | 0.007 | 0.000 | 0.031 | 0.000 | |
## | 0.005 | 0.000 | 0.005 | 0.000 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
## Column Total | 306 | 22 | 64 | 8 | 400 |
## | 0.765 | 0.055 | 0.160 | 0.020 | |
## ----------------------|-----------|-----------|-----------|-----------|-----------|
##
##
# Plotting confusion matrix
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(diabetesCT$prop.row[1,1], diabetesCT$prop.row[1,2], diabetesCT$prop.row[2,1], diabetesCT$prop.row[2,2]),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Diabetes Predictions)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))LDA does well for angina with high true negatives and true positives. Stroke is also good, but not as good as angina predictions. Diabetes is much worse with many false positives and virtually no true positives. Lets now look at decision trees.
# Just making a function to pull need values from the confusion matrices
calculate_metrics_percentage <- function(conf_matrix) {
total_positive <- sum(conf_matrix[1,])
total_negative <- sum(conf_matrix[2,])
TP <- conf_matrix[1, 1] / total_positive * 100
TN <- conf_matrix[2, 2] / total_negative * 100
FP <- conf_matrix[1, 2] / total_positive * 100
FN <- conf_matrix[2, 1] / total_negative * 100
return(list(TP = TP, TN = TN, FP = FP, FN = FN))
}DecisionTreeAngina <- C5.0(healthTrain[,-9], healthTrain$Angina)
DTAPred <- predict(DecisionTreeAngina,healthTest[,-9])
DTACT <- confusionMatrix(table(DTAPred,healthTest$Angina))
DTA_result <- calculate_metrics_percentage(DTACT$table)
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(DTA_result$TN, DTA_result$FN, DTA_result$FP, DTA_result$TP),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Angina Predictions)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))DecisionTreeStroke <- C5.0(healthTrain[,-10], healthTrain$Stroke)
DTSPred <- predict(DecisionTreeStroke,healthTest[,-10])
DTSCT <- confusionMatrix(table(DTSPred,healthTest$Stroke))
DTS_result <- calculate_metrics_percentage(DTSCT$table)
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(DTS_result$TN, DTS_result$FN, DTS_result$FP, DTS_result$TP),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Stroke Predictions)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))DecisionTreeDiabetes <- C5.0(healthTrain[,-11], healthTrain$Diabetes)
DTDPred <- predict(DecisionTreeDiabetes, healthTest[,-11])
DTDCT <-confusionMatrix(table(DTDPred,healthTest$Diabetes))
DTD_result <- calculate_metrics_percentage(DTDCT$table)
plot_ly(x = c("TN", "FN", "FP", "TP"),
y = c(DTA_result$TN, DTA_result$FN, DTA_result$FP, DTA_result$TP),
name = c("TN", "FN", "FP", "TP"), type = "bar", color=c("TN", "FN", "FP", "TP")) %>%
layout(title="Confusion Matrix (Diabetes Predictions)",
legend=list(title=list(text='<b> Metrics </b>')),yaxis=list(title='Probability'))Naive Bayes, LDA, and decision trees achieve high true negative and true positive scores for predicting angina, stroke, and diabetes. Now that we know that some supervised classifiers can accurately predict these three health complications, its natural to wonder what are the important predictive features.
I will incorporate three well-known approaches to feature selection: logistic regression, recursive feature elimination and decision trees.
## Logistic Regressions
anginaLM <- glm(Angina ~ ., data = health, family = "binomial")
summary(anginaLM)##
## Call:
## glm(formula = Angina ~ ., family = "binomial", data = health)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.4759 0.0000 0.0000 0.0462 3.4413
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.679e+01 1.318e+04 0.001 0.998983
## Age2 6.230e+00 1.263e+00 4.931 8.16e-07 ***
## Age3 9.018e+00 1.481e+00 6.087 1.15e-09 ***
## Age4 1.199e+01 1.666e+00 7.195 6.25e-13 ***
## Age5 1.622e+01 2.141e+00 7.573 3.63e-14 ***
## Age6 4.158e+01 2.363e+03 0.018 0.985963
## Sex2 -1.829e+00 5.013e-01 -3.649 0.000263 ***
## Race2 -9.829e-01 9.167e-01 -1.072 0.283609
## Race3 -8.209e-01 1.192e+00 -0.689 0.491064
## Race4 6.718e+00 6.856e+00 0.980 0.327136
## Race5 -2.352e+00 2.023e+00 -1.162 0.245042
## Race9 -6.370e-02 1.275e+00 -0.050 0.960150
## School3 -1.737e+01 1.318e+04 -0.001 0.998948
## School4 -2.082e+01 1.318e+04 -0.002 0.998739
## School5 -2.321e+01 1.318e+04 -0.002 0.998595
## School6 -2.457e+01 1.318e+04 -0.002 0.998512
## Marriage2 -2.785e-01 5.513e-01 -0.505 0.613454
## Marriage3 3.870e-01 6.845e-01 0.565 0.571852
## Marriage4 -1.546e+01 1.601e+03 -0.010 0.992298
## Marriage5 3.313e-01 6.096e-01 0.543 0.586829
## Marriage6 2.449e-02 9.976e-01 0.025 0.980418
## Employment2 2.813e+00 2.500e+00 1.125 0.260580
## Employment3 -1.959e+00 1.462e+00 -1.340 0.180261
## Employment4 -3.354e+00 1.176e+00 -2.851 0.004355 **
## Employment5 -7.093e-01 1.115e+00 -0.636 0.524676
## Employment6 -1.641e+00 1.697e+00 -0.967 0.333564
## Employment7 -9.317e-01 6.465e-01 -1.441 0.149553
## Employment8 -1.028e+00 5.656e-01 -1.817 0.069238 .
## Income2 -2.436e-01 8.518e-01 -0.286 0.774941
## Income3 1.170e-04 9.685e-01 0.000 0.999904
## Income4 -6.148e-01 7.638e-01 -0.805 0.420811
## Income5 4.917e-01 6.980e-01 0.704 0.481196
## Income9 1.396e+00 8.940e-01 1.561 0.118455
## HeartAttack2 -2.684e+00 1.753e+00 -1.531 0.125794
## Stroke2 6.602e-01 9.955e-01 0.663 0.507247
## Diabetes2 -1.168e+00 7.027e-01 -1.661 0.096628 .
## Diabetes3 1.436e+01 2.447e+03 0.006 0.995319
## Diabetes4 1.155e+01 6.140e+03 0.002 0.998499
## Smoking2 4.874e+00 9.731e-01 5.009 5.48e-07 ***
## Smoking9 2.076e+00 1.333e+00 1.558 0.119241
## Alcohol2 4.398e+00 1.745e+00 2.520 0.011745 *
## Alcohol9 1.495e+00 1.291e+00 1.158 0.246773
## Fruits2 1.612e+00 4.979e-01 3.238 0.001205 **
## Fruits9 -1.226e+00 1.031e+00 -1.189 0.234391
## Vegetables2 2.313e+00 6.204e-01 3.728 0.000193 ***
## Vegetables9 -4.775e-01 7.541e-01 -0.633 0.526553
## Leisure2 2.131e+00 6.307e-01 3.379 0.000728 ***
## Leisure9 -2.230e+00 1.373e+00 -1.624 0.104401
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 963.69 on 999 degrees of freedom
## Residual deviance: 188.18 on 952 degrees of freedom
## AIC: 284.18
##
## Number of Fisher Scoring iterations: 21
x <- summary(anginaLM)$coefficients[,4] < .05
anginaLMSig <- names(summary(anginaLM)$coefficients[x,4] < .05) ## Saving those variables that are significant for future variable selection method comparisons
strokeLM <- glm(Stroke ~ ., data = health, family = "binomial")
summary(strokeLM)##
## Call:
## glm(formula = Stroke ~ ., family = "binomial", data = health)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.701 0.000 0.000 0.000 2.397
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 6.575e+00 2.341e+04 0.000 0.999776
## Age2 6.692e+00 1.365e+00 4.903 9.42e-07 ***
## Age3 1.482e+01 3.009e+00 4.927 8.37e-07 ***
## Age4 3.289e+01 4.074e+03 0.008 0.993559
## Age5 3.551e+01 3.309e+03 0.011 0.991439
## Age6 3.436e+01 2.757e+03 0.012 0.990057
## Sex2 3.718e-01 7.262e-01 0.512 0.608654
## Race2 -9.632e-02 2.671e+00 -0.036 0.971234
## Race3 6.705e+00 2.715e+00 2.470 0.013516 *
## Race4 2.212e-01 5.137e+01 0.004 0.996564
## Race5 1.586e+00 3.227e+00 0.491 0.623205
## Race9 -2.058e+00 1.565e+00 -1.315 0.188573
## School3 -6.445e+00 2.341e+04 0.000 0.999780
## School4 -1.063e+01 2.341e+04 0.000 0.999638
## School5 -1.240e+01 2.341e+04 -0.001 0.999577
## School6 -1.665e+01 2.341e+04 -0.001 0.999433
## Marriage2 4.557e-02 1.047e+00 0.044 0.965298
## Marriage3 -5.448e-01 9.836e-01 -0.554 0.579645
## Marriage4 3.147e+00 8.088e+00 0.389 0.697185
## Marriage5 -2.088e+00 1.072e+00 -1.948 0.051375 .
## Marriage6 3.642e+00 2.224e+00 1.638 0.101523
## Employment2 2.135e-01 5.747e+00 0.037 0.970358
## Employment3 6.420e+00 2.168e+00 2.961 0.003069 **
## Employment4 -6.157e+00 2.513e+00 -2.450 0.014279 *
## Employment5 6.330e+00 1.988e+00 3.184 0.001451 **
## Employment6 1.814e+01 1.565e+04 0.001 0.999075
## Employment7 2.207e+00 1.160e+00 1.903 0.056992 .
## Employment8 2.405e+00 9.568e-01 2.514 0.011938 *
## Income2 -4.978e+00 1.847e+00 -2.695 0.007041 **
## Income3 -3.992e+00 1.894e+00 -2.107 0.035098 *
## Income4 -5.276e+00 1.860e+00 -2.836 0.004561 **
## Income5 -4.874e+00 1.648e+00 -2.958 0.003093 **
## Income9 -2.482e+00 1.597e+00 -1.554 0.120079
## HeartAttack2 1.979e+00 1.932e+00 1.025 0.305522
## Angina2 4.649e-01 1.282e+00 0.362 0.716981
## Diabetes2 -1.031e+00 1.314e+00 -0.785 0.432664
## Diabetes3 1.084e+01 3.299e+03 0.003 0.997379
## Diabetes4 -1.419e+00 1.168e+04 0.000 0.999903
## Smoking2 5.190e+00 1.500e+00 3.460 0.000541 ***
## Smoking9 -3.707e+00 1.965e+00 -1.887 0.059168 .
## Alcohol2 -1.583e+00 1.804e+00 -0.877 0.380374
## Alcohol9 2.200e-01 6.036e+00 0.036 0.970929
## Fruits2 1.399e+00 7.798e-01 1.794 0.072844 .
## Fruits9 -4.617e-01 1.680e+00 -0.275 0.783448
## Vegetables2 3.971e+00 1.116e+00 3.557 0.000375 ***
## Vegetables9 -1.493e+00 1.263e+00 -1.182 0.237295
## Leisure2 3.652e+00 1.050e+00 3.478 0.000504 ***
## Leisure9 -1.794e+00 1.808e+00 -0.992 0.321001
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 627.912 on 999 degrees of freedom
## Residual deviance: 88.892 on 952 degrees of freedom
## AIC: 184.89
##
## Number of Fisher Scoring iterations: 22
x <- summary(strokeLM)$coefficients[,4] < .05
strokeLMSig <- names(summary(strokeLM)$coefficients[x,4] < .05) ## Saving those variables that are significant for future variable selection method comparisons
# Need to use a multinomial linear regression since there are more than two levels to the diabetes variable
# Multi-nomial Linear Regression
diabetesLM <- multinom(Diabetes ~ ., data = health)## # weights: 188 (138 variable)
## initial value 1386.294361
## iter 10 value 485.423916
## iter 20 value 400.229607
## iter 30 value 391.035328
## iter 40 value 389.248554
## iter 50 value 388.918363
## iter 60 value 388.755521
## iter 70 value 388.588214
## iter 80 value 388.554762
## iter 90 value 388.456470
## iter 100 value 388.446908
## final value 388.446908
## stopped after 100 iterations
## Call:
## multinom(formula = Diabetes ~ ., data = health)
##
## Coefficients:
## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 -23.845640 0.228492 0.1890488 1.102002 -0.1193996 0.1000789 12.2415970
## 3 -10.123903 7.526140 8.3406602 10.293995 11.9331364 13.3489896 -0.8916207
## 4 -6.205156 -4.471158 8.2636577 10.222035 10.7712852 12.8469054 -0.7550098
## Race2 Race3 Race4 Race5 Race9 School3
## 2 1.3270821 -10.911943 -10.1172718 -10.23349935 2.448592 -13.897770
## 3 0.3640057 -2.268112 0.1320356 -0.01904884 2.338475 -4.437903
## 4 -12.3052437 -17.965338 -7.2198419 -9.19831731 4.208112 -8.111618
## School4 School5 School6 Marriage2 Marriage3 Marriage4
## 2 0.5390254 0.6464011 0.8929987 -0.6486809 -1.623709531 0.18674280
## 3 -6.2207739 -8.9783005 -11.6298273 -0.1458774 -0.001513464 -0.07101547
## 4 -8.6991456 -10.5154197 -14.3019830 -0.8386655 -0.452226177 -13.82239802
## Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 -0.54514462 -1.33242728 0.010669566 0.3828141 -10.066596 0.3708215
## 3 -0.94079389 -0.07848923 -0.003073865 -1.1689414 1.254144 0.6417032
## 4 0.01474081 -10.15974267 -10.079925026 1.6887563 -9.293229 0.4854404
## Employment6 Employment7 Employment8 Income2 Income3 Income4
## 2 1.7835135 -0.3997252 0.6013019 -0.40677611 -2.4787821 0.1220308
## 3 0.5621604 0.1089195 -0.4876101 -0.04393232 0.5190654 -0.4356655
## 4 -9.4477142 0.5840835 -0.4333128 0.76629283 1.2107869 0.6719532
## Income5 Income9 HeartAttack2 Angina2 Stroke2 Smoking2
## 2 -0.4871224 -0.05776236 7.0416523 -0.7617192 1.8094802 0.2649059
## 3 -0.3083258 0.34726289 -0.7082079 3.5307840 0.2570182 3.2793888
## 4 1.4394547 1.63426503 -3.0432397 4.7325083 -2.4347133 3.5799597
## Smoking9 Alcohol2 Alcohol9 Fruits2 Fruits9 Vegetables2 Vegetables9
## 2 0.3495308 -0.7915249 -1.3527356 0.1618617 1.0243905 0.7599804 0.8458477
## 3 0.2566212 2.4709877 0.9318741 1.4147436 0.4395236 1.1527861 0.7630708
## 4 -10.2700939 2.3654527 -9.7233765 1.2501587 0.7499800 0.9894764 -10.5734441
## Leisure2 Leisure9
## 2 0.7770696 -0.8974989
## 3 2.4250543 -0.5097800
## 4 1.4348926 -0.2750364
##
## Std. Errors:
## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 111.12640 1.107685 1.169325 1.208408 1.273265 1.268669 64.6700930
## 3 96.20731 92.981636 92.980287 92.979495 92.979599 92.979702 0.2752929
## 4 127.19162 74.372250 123.056849 123.063549 123.064328 123.064456 0.5258059
## Race2 Race3 Race4 Race5 Race9 School3 School4
## 2 0.5264059 1.642437e+02 180.365712 173.269557 1.034636 0.2920375 47.893981
## 3 0.5068182 1.222413e+00 1.049532 1.231685 1.105757 2.6216619 2.591366
## 4 176.6119494 9.528947e-05 145.206562 206.141227 1.265192 2.9627181 2.815176
## School5 School6 Marriage2 Marriage3 Marriage4 Marriage5 Marriage6
## 2 47.894122 47.893165 0.5516379 0.7974547 1.3808488802 0.5229251 1.1805731
## 3 2.634921 2.696002 0.4230981 0.3987608 1.5308131868 0.4095321 0.7007587
## 4 2.870052 3.194266 0.9117707 0.8174344 0.0002699382 0.6715403 144.7473168
## Employment2 Employment3 Employment4 Employment5 Employment6 Employment7
## 2 0.9569608 1.197182 193.653144 0.9093629 1.0911126 0.5920207
## 3 0.6754031 1.328496 1.041996 0.7145045 0.8090641 0.3882478
## 4 157.8802172 1.577241 233.680212 1.2734154 205.8712207 0.6650961
## Employment8 Income2 Income3 Income4 Income5 Income9 HeartAttack2
## 2 0.4538096 0.6316236 1.1421688 0.6147420 0.5590656 0.6638731 76.63196
## 3 0.3373228 0.5470485 0.5594095 0.5348256 0.4963819 0.5807680 18.70123
## 4 0.6326563 1.3110992 1.4828685 1.3040046 1.2511020 1.3597623 17.49951
## Angina2 Stroke2 Smoking2 Smoking9 Alcohol2 Alcohol9 Fruits2
## 2 0.5709167 1.086667 0.5347810 0.9680688 1.1686414 1.3473229 0.4018891
## 3 15.5643487 13.845363 0.3998010 0.9646119 0.4967988 0.8098973 0.2990350
## 4 63.8326779 36.321014 0.6515794 176.5866845 0.9322985 169.0603771 0.5310582
## Fruits9 Vegetables2 Vegetables9 Leisure2 Leisure9
## 2 0.5891724 0.4246063 0.6591481 0.4300163 1.1801469
## 3 0.6486549 0.3050981 0.5492398 0.3361840 0.6796632
## 4 1.2982978 0.5659303 172.8184802 0.6016299 1.2437793
##
## Residual Deviance: 776.8938
## AIC: 1052.894
z <- summary(diabetesLM)$coefficients/summary(diabetesLM)$standard.errors #Getting z-scores
p <- (1 - pnorm(abs(z),0,1))*2 ## Getting the p-values
p## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 0.8300938 0.8365730 0.8715631 0.3617975 0.9252884 0.9371241 0.849863168
## 3 0.9161932 0.9354879 0.9285228 0.9118443 0.8978788 0.8858409 0.001200309
## 4 0.9610899 0.9520612 0.9464598 0.9338014 0.9302537 0.9168585 0.151028005
## Race2 Race3 Race4 Race5 Race9 School3
## 2 0.01170138 0.94702951 0.9552676 0.9529034 0.0179512756 0.000000000
## 3 0.47262325 0.06353354 0.8998868 0.9876607 0.0344455047 0.090496934
## 4 0.94445321 0.00000000 0.9603446 0.9644091 0.0008808087 0.006183338
## School4 School5 School6 Marriage2 Marriage3 Marriage4
## 2 0.991020355 0.9892317103 9.851238e-01 0.2396277 0.04173881 0.8924240
## 3 0.016369252 0.0006557862 1.605218e-05 0.7302569 0.99697170 0.9629989
## 4 0.002000961 0.0002484598 7.556029e-06 0.3576665 0.58010847 0.0000000
## Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 0.29718413 0.2590550 0.9911042 0.7491483 0.9585426 0.6834340
## 3 0.02160503 0.9108186 0.9963687 0.3789144 0.2287452 0.3691272
## 4 0.98248725 0.9440428 0.9490933 0.2843032 0.9682773 0.7030464
## Employment6 Employment7 Employment8 Income2 Income3 Income4 Income5
## 2 0.1021366 0.4995564 0.1851681 0.5195646 0.02998857 0.8426481 0.3835821
## 3 0.4871631 0.7790623 0.1483092 0.9359924 0.35346933 0.4153050 0.5345033
## 4 0.9633968 0.3798384 0.4934006 0.5589069 0.41420468 0.6063436 0.2499176
## Income9 HeartAttack2 Angina2 Stroke2 Smoking2 Smoking9
## 2 0.9306650 0.9267860 0.1821371 0.09587978 6.203503e-01 0.7180547
## 3 0.5498817 0.9697917 0.8205398 0.98518933 2.220446e-16 0.7902117
## 4 0.2294117 0.8619407 0.9408996 0.94655528 3.923081e-08 0.9536220
## Alcohol2 Alcohol9 Fruits2 Fruits9 Vegetables2 Vegetables9
## 2 4.982134e-01 0.3153702 6.871306e-01 0.08208862 0.0734784985 0.1994067
## 3 6.564649e-07 0.2498937 2.233832e-06 0.49803016 0.0001578317 0.1647350
## 4 1.117346e-02 0.9541356 1.856812e-02 0.56349094 0.0803935932 0.9512140
## Leisure2 Leisure9
## 2 7.075133e-02 0.4469572
## 3 5.453415e-13 0.4532258
## 4 1.707871e-02 0.8249915
## Feature Selection using Random Forest
BorutaAngina <- Boruta(Angina ~., data = health, doTrace=0)
BorutaAngina## Boruta performed 99 iterations in 12.68731 secs.
## 8 attributes confirmed important: Age, Alcohol, Diabetes, Leisure,
## School and 3 more;
## 6 attributes confirmed unimportant: Employment, HeartAttack, Income,
## Marriage, Race and 1 more;
## 1 tentative attributes left: Fruits;
df_long <- tidyr::gather(as.data.frame(BorutaAngina$ImpHistory), feature, measurement)
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all Features",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)BorutaAnginaVars <- getSelectedAttributes(BorutaAngina)
# Recursive Feature selection
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainAngina <- rfe(healthTrain[,-9], healthTrain[,9],sizes=c(10, 15), rfeControl=control)
rf.trainAngina##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 10 0.9317 0.7474 0.02126 0.0781 *
## 15 0.9118 0.6443 0.02683 0.1191
##
## The top 5 variables (out of 10):
## Age, Stroke, School, Smoking, Sex
Lets see what variables each feature selection method had in common for angina.
## [1] "Age" "Sex" "School" "Stroke" "Diabetes" "Smoking" "Alcohol"
## [8] "Leisure"
## [1] "Age" "Stroke" "School" "Smoking" "Sex"
## [6] "Alcohol" "Diabetes" "Leisure" "Marriage" "Vegetables"
## [1] "Age2" "Age3" "Age4" "Age5" "Sex2"
## [6] "Employment4" "Smoking2" "Alcohol2" "Fruits2" "Vegetables2"
## [11] "Leisure2"
anginaOverlap <- c("Age","Smoking","Alcohol","Leisure") # For predicting angina, these are the features that are in common for all of the feature selection method used.Next, I will apply random forest and recursive feature elimination to stroke and diabetes.
## Boruta performed 99 iterations in 10.0384 secs.
## 4 attributes confirmed important: Age, Angina, Diabetes, School;
## 7 attributes confirmed unimportant: Alcohol, Employment, Fruits,
## HeartAttack, Marriage and 2 more;
## 4 tentative attributes left: Income, Leisure, Sex, Smoking;
df_long <- tidyr::gather(as.data.frame(BorutaStroke$ImpHistory), feature, measurement)
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all Features",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)BorutaStrokeVars <- getSelectedAttributes(BorutaStroke)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainStroke <- rfe(healthTrain[,-10], healthTrain[,10],sizes=c(10, 15), rfeControl=control)
rf.trainStroke##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 10 0.9717 0.8287 0.01582 0.09457 *
## 15 0.9701 0.8056 0.02038 0.14382
##
## The top 5 variables (out of 10):
## Age, Angina, Diabetes, School, Income
## [1] "Age" "School" "Angina" "Diabetes"
## [1] "Age" "Angina" "Diabetes" "School" "Income"
## [6] "Leisure" "Sex" "Fruits" "Smoking" "Vegetables"
## [1] "Age2" "Age3" "Race3" "Employment3" "Employment4"
## [6] "Employment5" "Employment8" "Income2" "Income3" "Income4"
## [11] "Income5" "Smoking2" "Vegetables2" "Leisure2"
strokeOverlap <- c("Age","Income") # For predicting stroke, these are the features that are in common for all of the feature selection method used.
BorutaDiabetes <- Boruta(Diabetes ~., data = health, doTrace=0)
BorutaDiabetes## Boruta performed 99 iterations in 19.82274 secs.
## 7 attributes confirmed important: Age, Alcohol, Angina, Leisure,
## School and 2 more;
## 5 attributes confirmed unimportant: Employment, HeartAttack, Income,
## Marriage, Race;
## 3 tentative attributes left: Fruits, Sex, Vegetables;
df_long <- tidyr::gather(as.data.frame(BorutaDiabetes$ImpHistory), feature, measurement)
plot_ly(df_long, y = ~measurement, color = ~feature, type = "box") %>%
layout(title="Box-and-whisker Plots across all Features",
xaxis = list(title="Features"),
yaxis = list(title="Importance"),
showlegend=F)BorutaDiabetesVars <- getSelectedAttributes(BorutaDiabetes)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)
rf.trainDiabetes <- rfe(healthTrain[,-11], healthTrain[,11],sizes=c(10, 15), rfeControl=control)
rf.trainDiabetes##
## Recursive feature selection
##
## Outer resampling method: Cross-Validated (10 fold)
##
## Resampling performance over subset size:
##
## Variables Accuracy Kappa AccuracySD KappaSD Selected
## 10 0.7767 0.3557 0.04245 0.1152
## 15 0.7918 0.3599 0.03677 0.1260 *
##
## The top 5 variables (out of 15):
## School, Age, Smoking, Leisure, Angina
## [1] "Age" "School" "Angina" "Stroke" "Smoking" "Alcohol" "Leisure"
## [1] "School" "Age" "Smoking" "Leisure" "Angina"
## [6] "Stroke" "Fruits" "Vegetables" "Race" "Income"
## [11] "Sex" "Marriage" "Employment" "Alcohol" "HeartAttack"
## (Intercept) Age2 Age3 Age4 Age5 Age6 Sex2
## 2 0.8300938 0.8365730 0.8715631 0.3617975 0.9252884 0.9371241 0.849863168
## 3 0.9161932 0.9354879 0.9285228 0.9118443 0.8978788 0.8858409 0.001200309
## 4 0.9610899 0.9520612 0.9464598 0.9338014 0.9302537 0.9168585 0.151028005
## Race2 Race3 Race4 Race5 Race9 School3
## 2 0.01170138 0.94702951 0.9552676 0.9529034 0.0179512756 0.000000000
## 3 0.47262325 0.06353354 0.8998868 0.9876607 0.0344455047 0.090496934
## 4 0.94445321 0.00000000 0.9603446 0.9644091 0.0008808087 0.006183338
## School4 School5 School6 Marriage2 Marriage3 Marriage4
## 2 0.991020355 0.9892317103 9.851238e-01 0.2396277 0.04173881 0.8924240
## 3 0.016369252 0.0006557862 1.605218e-05 0.7302569 0.99697170 0.9629989
## 4 0.002000961 0.0002484598 7.556029e-06 0.3576665 0.58010847 0.0000000
## Marriage5 Marriage6 Employment2 Employment3 Employment4 Employment5
## 2 0.29718413 0.2590550 0.9911042 0.7491483 0.9585426 0.6834340
## 3 0.02160503 0.9108186 0.9963687 0.3789144 0.2287452 0.3691272
## 4 0.98248725 0.9440428 0.9490933 0.2843032 0.9682773 0.7030464
## Employment6 Employment7 Employment8 Income2 Income3 Income4 Income5
## 2 0.1021366 0.4995564 0.1851681 0.5195646 0.02998857 0.8426481 0.3835821
## 3 0.4871631 0.7790623 0.1483092 0.9359924 0.35346933 0.4153050 0.5345033
## 4 0.9633968 0.3798384 0.4934006 0.5589069 0.41420468 0.6063436 0.2499176
## Income9 HeartAttack2 Angina2 Stroke2 Smoking2 Smoking9
## 2 0.9306650 0.9267860 0.1821371 0.09587978 6.203503e-01 0.7180547
## 3 0.5498817 0.9697917 0.8205398 0.98518933 2.220446e-16 0.7902117
## 4 0.2294117 0.8619407 0.9408996 0.94655528 3.923081e-08 0.9536220
## Alcohol2 Alcohol9 Fruits2 Fruits9 Vegetables2 Vegetables9
## 2 4.982134e-01 0.3153702 6.871306e-01 0.08208862 0.0734784985 0.1994067
## 3 6.564649e-07 0.2498937 2.233832e-06 0.49803016 0.0001578317 0.1647350
## 4 1.117346e-02 0.9541356 1.856812e-02 0.56349094 0.0803935932 0.9512140
## Leisure2 Leisure9
## 2 7.075133e-02 0.4469572
## 3 5.453415e-13 0.4532258
## 4 1.707871e-02 0.8249915
Each of the classification methods obtained high accuracy in predicting whether a given patient had one of the chronic conditions. However, the cohorts were unbalanced, with healthy patients far out weighing the ones with a chronic condition. Future work should attempt collect more balanced cohorts. Demographic and behavioral variables important for predicting angina include age, smoking, alcohol, and leisure; for stroke, age and income; and for diabetes, smoking, education, alcohol, and leisure. These chosen variables are based on the overlap between the three feature selection methods, but it might be reasonable to include other variables that overlap between only two of the feature selection methods. For example, smoking was identified by logistic regression and RFE to be an important predictor for stroke but not by the random forest algorithm. Future work would benefit from including more behaviors, both helpful and negative for the given conditions.
Thank you to Dr. Dinov at the University of Michigan for providing the data-set.
Virani SS, Alonso A, Aparicio HJ, Benjamin EJ, Bittencourt MS, Callaway CW, et al. Heart disease and stroke statistics—2021 update: a report from the American Heart Associationexternal icon. Circulation. 2021;143:e254–e743.